{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "view-in-github"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/NeuromatchAcademy/course-content/blob/master/tutorials/W2D5_ReinforcementLearning/W2D5_Tutorial2.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "# Neuromatch Academy: Week 3, Day 4, Tutorial 2\n",
    "# Learning to Act: Multi-Armed Bandits\n",
    "\n",
    "__Content creators:__ Marcelo Mattar and Eric DeWitt with help from Byron Galbraith\n",
    "\n",
    "__Content reviewers:__ Matt Krause and Michael Waskom\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "\n",
    "# Tutorial Objectives\n",
    "  \n",
    "In this tutorial you will use 'bandits' to understand the fundementals of how a policy interacts with the learning algorithm in reinforcement learning.\n",
    "    \n",
    "* You will understand the fundemental tradeoff between exploration and exploitation in a policy.\n",
    "* You will understand how the learning rate interacts with exploration to find the best available action."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "# Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "both",
    "colab": {},
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:19:00.135865Z",
     "iopub.status.busy": "2021-05-25T01:19:00.135291Z",
     "iopub.status.idle": "2021-05-25T01:19:00.441612Z",
     "shell.execute_reply": "2021-05-25T01:19:00.440411Z"
    }
   },
   "outputs": [],
   "source": [
    "# Imports\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab": {},
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:19:00.446206Z",
     "iopub.status.busy": "2021-05-25T01:19:00.445655Z",
     "iopub.status.idle": "2021-05-25T01:19:00.527723Z",
     "shell.execute_reply": "2021-05-25T01:19:00.528222Z"
    }
   },
   "outputs": [],
   "source": [
    "#@title Figure settings\n",
    "import ipywidgets as widgets       # interactive display\n",
    "%config InlineBackend.figure_format = 'retina'\n",
    "plt.style.use(\"https://raw.githubusercontent.com/NeuromatchAcademy/course-content/master/nma.mplstyle\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab": {},
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:19:00.539310Z",
     "iopub.status.busy": "2021-05-25T01:19:00.530568Z",
     "iopub.status.idle": "2021-05-25T01:19:00.541754Z",
     "shell.execute_reply": "2021-05-25T01:19:00.542217Z"
    }
   },
   "outputs": [],
   "source": [
    "#@title Helper functions\n",
    "np.set_printoptions(precision=3)\n",
    "\n",
    "def plot_choices(q, epsilon, choice_fn, n_steps=1000, rng_seed=1):\n",
    "  np.random.seed(rng_seed)\n",
    "  counts = np.zeros_like(q)\n",
    "  for t in range(n_steps):\n",
    "    action = choice_fn(q, epsilon)\n",
    "    counts[action] += 1\n",
    "\n",
    "  fig, ax = plt.subplots()\n",
    "  ax.bar(range(len(q)), counts/n_steps)\n",
    "  ax.set(ylabel='% chosen', xlabel='action', ylim=(0,1), xticks=range(len(q)))\n",
    "\n",
    "\n",
    "def plot_multi_armed_bandit_results(results):\n",
    "  fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(20, 4))\n",
    "  ax1.plot(results['rewards'])\n",
    "  ax1.set(title=f\"Total Reward: {np.sum(results['rewards']):.2f}\",\n",
    "          xlabel='step', ylabel='reward')\n",
    "  ax2.plot(results['qs'])\n",
    "  ax2.set(xlabel='step', ylabel='value')\n",
    "  ax2.legend(range(len(results['mu'])))\n",
    "  ax3.plot(results['mu'], label='latent')\n",
    "  ax3.plot(results['qs'][-1], label='learned')\n",
    "  ax3.set(xlabel='action', ylabel='value')\n",
    "  ax3.legend()\n",
    "\n",
    "\n",
    "def plot_parameter_performance(labels, fixed, trial_rewards, trial_optimal):\n",
    "  fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(16, 6))\n",
    "\n",
    "  ax1.plot(np.mean(trial_rewards, axis=1).T)\n",
    "  ax1.set(title=f'Average Reward ({fixed})', xlabel='step', ylabel='reward')\n",
    "  ax1.legend(labels)\n",
    "\n",
    "  ax2.plot(np.mean(trial_optimal, axis=1).T)\n",
    "  ax2.set(title=f'Performance ({fixed})', xlabel='step', ylabel='% optimal')\n",
    "  ax2.legend(labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "# Section 1: Multi-Armed Bandits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 519
    },
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:19:00.550253Z",
     "iopub.status.busy": "2021-05-25T01:19:00.549709Z",
     "iopub.status.idle": "2021-05-25T01:19:00.593850Z",
     "shell.execute_reply": "2021-05-25T01:19:00.593352Z"
    },
    "outputId": "5d62d730-7dd0-4aee-baaf-91a053fc9adb"
   },
   "outputs": [],
   "source": [
    "#@title Video 1: Multi-Armed Bandits\n",
    "# Insert the ID of the corresponding youtube video\n",
    "from IPython.display import YouTubeVideo\n",
    "video = YouTubeVideo(id=\"kdiXr1zsfo0\", width=854, height=480, fs=1)\n",
    "print(\"Video available at https://youtu.be/\" + video.id)\n",
    "video"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "Consider the following learning problem. You are faced repeatedly with a choice among $k$ different options, or actions. After each choice you receive a reward signal in the form of a numerical value, where the larger value is the better. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.\n",
    "\n",
    "This is the original form of the k-armed bandit problem. This name derives from the colloquial name for a slot machine, the \"one-armed bandit\", because it has the one lever to pull, and it is often rigged to take more money than it pays out over time. The multi-armed bandit extension is to imagine, for instance, that you are faced with multiple slot machines that you can play, but only one at a time. Which machine should you play, i.e. which arm should you pull, which action should you take, at any given time to maximize your total payout.\n",
    "\n",
    "<img alt=\"MultiArmedBandit\" width=\"625\" height=\"269\" src=\"https://github.com/NeuromatchAcademy/course-content/blob/master/tutorials/W2D5_ReinforcementLearning/static/W2D5_Tutorial2_MultiarmedBandit.png?raw=true\">\n",
    "\n",
    "\n",
    "While there are many different levels of sophistication and assumptions in how the rewards are determined, for simplicity's sake we will assume that each action results in a reward drawn from a fixed Gaussian distribution with unknown mean and unit variance. This problem setting is referred to as the *environment*, and goal is to find the arm with the highest mean value.\n",
    "\n",
    "We will solve this *optimization problem* with an *agent*, in this case an algorithm that takes in rewards and returns actions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "# Section 2: Choosing an Action\n",
    "   \n",
    "The first thing our agent needs to be able to do is choose which arm to pull. The strategy for choosing actions based on our expectations is called a *policy* (often denoted $\\pi$). We could have a random policy -- just pick an arm at random each time -- though this doesn't seem likely to be capable of optimizing our reward. We want some intentionality, and to do that we need a way of describing our beliefs about the arms' reward potential. We do this with an action-value function\n",
    "\n",
    "\\begin{align}\n",
    "q(a) = \\mathbb{E} [r_{t} | a_{t} = a]\n",
    "\\end{align}\n",
    "\n",
    "where the value $q$ for taking action $a \\in A$ at time $t$ is equal to the expected value of the reward $r_t$ given that we took action $a$ at that time. In practice, this is often represented as an array of values, where each action's value is a different element in the array.\n",
    "\n",
    "Great, now that we have a way to describe our beliefs about the values each action should return, let's come up with a policy.\n",
    "\n",
    "An obvious choice would be to take the action with the highest expected value. This is referred to as the *greedy* policy\n",
    "\n",
    "\\begin{align}\n",
    "a_{t} = \\text{argmax}_{a} \\; q_{t} (a)\n",
    "\\end{align}\n",
    "\n",
    "where our choice action is the one that maximizes the current value function.\n",
    "\n",
    "So far so good, but it can't be this easy. And, in fact, the greedy policy does have a fatal flaw: it easily gets trapped in local maxima. It never explores to see what it hasn't seen before if one option is already better than the others. This leads us to a fundamental challenge in coming up with effective polices."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Section 2.1: The Exploitation-Exploration Dilemma\n",
    "\n",
    "If we never try anything new, if we always stick to the safe bet, we don't know what we are missing. Sometimes we aren't missing much of anything, and regret not sticking with our preferred choice, yet other times we stumble upon something new that was way better than we thought.\n",
    "\n",
    "This is the exploitation-exploration dilemma: do you go with you best choice now, or risk the less certain option with the hope of finding something better. Too much exploration, however, means you may end up with a sub-optimal reward once it's time to stop.\n",
    "\n",
    "In order to avoid getting stuck in local minima while also maximizing reward, effective policies need some way to balance between these two aims.\n",
    "\n",
    "A simple extension to our greedy policy is to add some randomness. For instance, a coin flip -- heads we take the best choice now, tails we pick one at random. This is referred to as the $\\epsilon$-greedy policy:\n",
    "\n",
    "\\begin{align}\n",
    "P (a_{t} = a) = \n",
    "        \\begin{cases}\n",
    "        1 - \\epsilon + \\epsilon/N    & \\quad \\text{if } a_{t} = \\text{argmax}_{a} \\; q_{t} (a) \\\\\n",
    "        \\epsilon/N        & \\quad \\text{else} \n",
    "        \\end{cases} \n",
    "\\end{align}\n",
    "\n",
    "which is to say that with probability 1 - $\\epsilon$ for $\\epsilon \\in [0,1]$ we select the greedy choice, and otherwise we select an action at random (including the greedy option).\n",
    "\n",
    "Despite its relative simplicity, the epsilon-greedy policy is quite effective, which leads to its general popularity."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "### Exercise 1: Implement Epsilon-Greedy\n",
    "\n",
    "In this exercise you will implement the epsilon-greedy algorithm for deciding which action to take from a set of possible actions given their value function and a probability $\\epsilon$ of simply chosing one at random. \n",
    "\n",
    "TIP: You may find [`np.random.random`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.random.html), [`np.random.choice`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html), and [`np.argmax`](https://numpy.org/doc/stable/reference/generated/numpy.argmax.html) useful here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:19:00.598512Z",
     "iopub.status.busy": "2021-05-25T01:19:00.597989Z",
     "iopub.status.idle": "2021-05-25T01:19:00.601591Z",
     "shell.execute_reply": "2021-05-25T01:19:00.601136Z"
    }
   },
   "outputs": [],
   "source": [
    "def epsilon_greedy(q, epsilon):\n",
    "  \"\"\"Epsilon-greedy policy: selects the maximum value action with probabilty\n",
    "  (1-epsilon) and selects randomly with epsilon probability.\n",
    "\n",
    "  Args:\n",
    "    q (ndarray): an array of action values\n",
    "    epsilon (float): probability of selecting an action randomly\n",
    "\n",
    "  Returns:\n",
    "    int: the chosen action\n",
    "  \"\"\"\n",
    "  #####################################################################\n",
    "  ## TODO for students: implement the epsilon greedy decision algorithm\n",
    "  # Fill out function and remove\n",
    "  raise NotImplementedError(\"Student excercise: implement the epsilon greedy decision algorithm\")\n",
    "  #####################################################################\n",
    "  # write a boolean expression that determines if we should take the best action\n",
    "  be_greedy = ...\n",
    "  if be_greedy:\n",
    "    # write an expression for selecting the best action from the action values\n",
    "    action = ...\n",
    "  else:\n",
    "    # write an expression for selecting a random action\n",
    "    action = ...\n",
    "\n",
    "  return action\n",
    "\n",
    "\n",
    "# Uncomment once the epsilon_greedy function is complete\n",
    "# q = [-2, 5, 0, 1]\n",
    "# epsilon = 0.1\n",
    "# plot_choices(q, epsilon, epsilon_greedy)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 430
    },
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:19:00.607619Z",
     "iopub.status.busy": "2021-05-25T01:19:00.606411Z",
     "iopub.status.idle": "2021-05-25T01:19:00.830866Z",
     "shell.execute_reply": "2021-05-25T01:19:00.830016Z"
    },
    "outputId": "ba6b5634-88e4-4630-c3bd-df66a9964b8c"
   },
   "outputs": [],
   "source": [
    "# to_remove solution\n",
    "def epsilon_greedy(q, epsilon):\n",
    "  \"\"\"Epsilon-greedy policy: selects the maximum value action with probabilty\n",
    "  (1-epsilon) and selects randomly with epsilon probability.\n",
    "\n",
    "  Args:\n",
    "    q (ndarray): an array of action values\n",
    "    epsilon (float): probability of selecting an action randomly\n",
    "\n",
    "  Returns:\n",
    "    int: the chosen action\n",
    "  \"\"\"\n",
    "  # write a boolean expression that determines if we should take the best action\n",
    "  be_greedy = np.random.random() > epsilon\n",
    "  if be_greedy:\n",
    "    # write an expression for selecting the best action from the action values\n",
    "    action = np.argmax(q)\n",
    "  else:\n",
    "    # write an expression for selecting a random action\n",
    "    action = np.random.choice(len(q))\n",
    "\n",
    "  return action\n",
    "\n",
    "\n",
    "q = [-2, 5, 0, 1]\n",
    "epsilon = 0.1\n",
    "with plt.xkcd():\n",
    "  plot_choices(q, epsilon, epsilon_greedy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "This is what we should expect, that the action with the largest value (action 1) is selected about (1-$\\epsilon$) of the time, or 90% for $\\epsilon = 0.1$, and the remaining 10% is split evenly amongst the other options. Use the demo below to explore how changing $\\epsilon$ affects the distribution of selected actions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "### Interactive Demo: Changing Epsilon\n",
    "\n",
    "Epsilon is our one parameter for balancing exploitation and exploration.  Given a set of values $q = [-2, 5, 0, 1]$, use the widget below to see how changing $\\epsilon$ influences our selection of the max value 5 (action = 1) vs the others. \n",
    "\n",
    "At the extremes of its range (0 and 1), the $\\epsilon$-greedy policy reproduces two other policies. What are they?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 462,
     "referenced_widgets": [
      "3e78de4799c44ccaa734da2f540c36fe",
      "9e9365377b244ab4838148cd1b3b1aaa",
      "4de0ca2a2525460e8a4938a14454d097",
      "afdbc7bec4c04fa6a6e9ae5ba87a3302",
      "0241f594ddf64549afe29486a34e9046",
      "10fee7825c7a4af2a7328f6f7b62294d",
      "966850380cd3447da1fd1607ccb037be"
     ]
    },
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:19:00.852738Z",
     "iopub.status.busy": "2021-05-25T01:19:00.848885Z",
     "iopub.status.idle": "2021-05-25T01:19:01.028043Z",
     "shell.execute_reply": "2021-05-25T01:19:01.028512Z"
    },
    "outputId": "a0e01ffc-c562-460d-f2f4-226c00c0dc03"
   },
   "outputs": [],
   "source": [
    "#@title\n",
    "\n",
    "#@markdown Make sure you execute this cell to enable the widget!\n",
    "\n",
    "@widgets.interact(epsilon=widgets.FloatSlider(0.1, min=0.0, max=1.0))\n",
    "def explore_epilson_values(epsilon=0.1):\n",
    "  q = [-2, 5, 0, 1]\n",
    "  plot_choices(q, epsilon, epsilon_greedy, rng_seed=None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:19:01.031721Z",
     "iopub.status.busy": "2021-05-25T01:19:01.031170Z",
     "iopub.status.idle": "2021-05-25T01:19:01.034996Z",
     "shell.execute_reply": "2021-05-25T01:19:01.035528Z"
    }
   },
   "outputs": [],
   "source": [
    "#to_remove explanation\n",
    "\"\"\"\n",
    "When epsilon is zero, the agent always chooses the currently best option; it\n",
    "becomes greedy. When epsilon is 1, the agent chooses randomly.\n",
    "\"\"\";"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "# Section 3: Learning from Rewards\n",
    "\n",
    "Now that we have a policy for deciding what to do, how do we learn from our actions?\n",
    "\n",
    "One way to do this is just keep a record of every result we ever got and use the averages for each action. If we have a potentially very long running episode, the computational cost of keeping all these values and recomputing the mean over and over again isn't ideal. Instead we can use a streaming mean calculation, which looks like this:\n",
    "\n",
    "\\begin{align}\n",
    "q_{t+1}(a) \\leftarrow q_{t}(a) + \\frac{1}{n_t} (r_{t} - q_{t}(a))\n",
    "\\end{align}\n",
    "\n",
    "where our action-value function $q_t(a)$ is the mean of the rewards seen so far, $n_t$ is the number of actions taken by time $t$, and $r_t$ is the reward just received for taking action $a$.\n",
    "\n",
    "This still requires us to remember how many actions we've taken, so let's generalize this a bit further and replace the action total with a general parameter $\\alpha$, which we will call the learning rate\n",
    "\n",
    "\\begin{align}\n",
    "q_{t+1}(a) \\leftarrow q_{t}(a) + \\alpha (r_{t} - q_{t}(a)).\n",
    "\\end{align}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Exercise 2: Updating Action Values\n",
    "\n",
    "In this exercise you will implement the action-value update rule above. The function will take in the action-value function represented as an array `q`, the action taken, the reward received, and the learning rate, `alpha`. The function will return the updated value for the selection action."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:19:01.039813Z",
     "iopub.status.busy": "2021-05-25T01:19:01.039214Z",
     "iopub.status.idle": "2021-05-25T01:19:01.042505Z",
     "shell.execute_reply": "2021-05-25T01:19:01.042927Z"
    }
   },
   "outputs": [],
   "source": [
    "def update_action_value(q, action, reward, alpha):\n",
    "  \"\"\" Compute the updated action value given the learning rate and observed\n",
    "  reward.\n",
    "\n",
    "  Args:\n",
    "    q (ndarray): an array of action values\n",
    "    action (int): the action taken\n",
    "    reward (float): the reward received for taking the action\n",
    "    alpha (float): the learning rate\n",
    "\n",
    "  Returns:\n",
    "    float: the updated value for the selected action\n",
    "  \"\"\"\n",
    "  #####################################################\n",
    "  ## TODO for students: compute the action value update\n",
    "  # Fill out function and remove\n",
    "  raise NotImplementedError(\"Student excercise: compute the action value update\")\n",
    "  #####################################################\n",
    "  # write an expression for the updated action value\n",
    "  value = ...\n",
    "  return value\n",
    "\n",
    "\n",
    "# Uncomment once the update_action_value function is complete\n",
    "# q = [-2, 5, 0, 1]\n",
    "# action = 2\n",
    "# print(f\"Original q({action}) value = {q[action]}\")\n",
    "# q[action] = update_action_value(q, 2, 10, 0.01)\n",
    "# print(f\"Updated q({action}) value = {q[action]}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 52
    },
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:19:01.049916Z",
     "iopub.status.busy": "2021-05-25T01:19:01.048515Z",
     "iopub.status.idle": "2021-05-25T01:19:01.051697Z",
     "shell.execute_reply": "2021-05-25T01:19:01.051223Z"
    },
    "outputId": "2f878e4e-3725-437c-b283-3a8efd6db788"
   },
   "outputs": [],
   "source": [
    "# to_remove solution\n",
    "def update_action_value(q, action, reward, alpha):\n",
    "  \"\"\" Compute the updated action value given the learning rate and observed\n",
    "  reward.\n",
    "\n",
    "  Args:\n",
    "    q (ndarray): an array of action values\n",
    "    action (int): the action taken\n",
    "    reward (float): the reward received for taking the action\n",
    "    alpha (float): the learning rate\n",
    "\n",
    "  Returns:\n",
    "    float: the updated value for the selected action\n",
    "  \"\"\"\n",
    "  # write an expression for the updated action value\n",
    "  value = q[action] + alpha * (reward - q[action])\n",
    "  return value\n",
    "\n",
    "\n",
    "q = [-2, 5, 0, 1]\n",
    "action = 2\n",
    "print(f\"Original q({action}) value = {q[action]}\")\n",
    "q[action] = update_action_value(q, 2, 10, 0.01)\n",
    "print(f\"Updated q({action}) value = {q[action]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "# Section 4: Solving Multi-Armed Bandits\n",
    "\n",
    "Now that we have both a policy and a learning rule, we can combine these to solve our original multi-armed bandit task. Recall that we have some number of arms that give rewards drawn from Gaussian distributions with unknown mean and unit variance, and our goal is to find the arm with the highest mean.\n",
    "\n",
    "First, let's see how we will simulate this environment by reading through the annotated code below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:19:01.059962Z",
     "iopub.status.busy": "2021-05-25T01:19:01.058534Z",
     "iopub.status.idle": "2021-05-25T01:19:01.060599Z",
     "shell.execute_reply": "2021-05-25T01:19:01.061060Z"
    }
   },
   "outputs": [],
   "source": [
    "def multi_armed_bandit(n_arms, epsilon, alpha, n_steps):\n",
    "  \"\"\" A Gaussian multi-armed bandit using an epsilon-greedy policy. For each\n",
    "  action, rewards are randomly sampled from normal distribution, with a mean\n",
    "  associated with that arm and unit variance.\n",
    "\n",
    "  Args:\n",
    "    n_arms (int): number of arms or actions\n",
    "    epsilon (float): probability of selecting an action randomly\n",
    "    alpha (float): the learning rate\n",
    "    n_steps (int): number of steps to evaluate\n",
    "\n",
    "  Returns:\n",
    "    dict: a dictionary containing the action values, actions, and rewards from\n",
    "    the evaluation along with the true arm parameters mu and the optimality of\n",
    "    the chosen actions.\n",
    "  \"\"\"\n",
    "  # Gaussian bandit parameters\n",
    "  mu = np.random.normal(size=n_arms)\n",
    "\n",
    "  # evaluation and reporting state\n",
    "  q = np.zeros(n_arms)\n",
    "  qs = np.zeros((n_steps, n_arms))\n",
    "  rewards = np.zeros(n_steps)\n",
    "  actions = np.zeros(n_steps)\n",
    "  optimal = np.zeros(n_steps)\n",
    "\n",
    "  # run the bandit\n",
    "  for t in range(n_steps):\n",
    "    # choose an action\n",
    "    action = epsilon_greedy(q, epsilon)\n",
    "    actions[t] = action\n",
    "\n",
    "    # copmute rewards for all actions\n",
    "    all_rewards = np.random.normal(mu)\n",
    "    # observe the reward for the chosen action\n",
    "    reward = all_rewards[action]\n",
    "    rewards[t] = reward\n",
    "    # was it the best possible choice?\n",
    "    optimal_action = np.argmax(all_rewards)\n",
    "    optimal[t] = action == optimal_action\n",
    "\n",
    "    # update the action value\n",
    "    q[action] = update_action_value(q, action, reward, alpha)\n",
    "    qs[t] = q\n",
    "\n",
    "  results = {\n",
    "      'qs': qs,\n",
    "      'actions': actions,\n",
    "      'rewards': rewards,\n",
    "      'mu': mu,\n",
    "      'optimal': optimal\n",
    "  }\n",
    "\n",
    "  return results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "We can use our multi-armed bandit method to evaluate how our epsilon-greedy policy and learning rule perform at solving the task. First we will set our environment to have 10 arms and our agent parameters to $\\epsilon=0.1$ and $\\alpha=0.01$. In order to get a good sense of the agent's performance, we will run the episode for 1000 steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 431
    },
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:19:01.067486Z",
     "iopub.status.busy": "2021-05-25T01:19:01.066922Z",
     "iopub.status.idle": "2021-05-25T01:19:01.817901Z",
     "shell.execute_reply": "2021-05-25T01:19:01.817393Z"
    },
    "outputId": "a053df3b-3923-43d5-ae2b-3baae165ec2b"
   },
   "outputs": [],
   "source": [
    "# set for reproducibility, comment out / change seed value for different results\n",
    "np.random.seed(1)\n",
    "\n",
    "n_arms = 10\n",
    "epsilon = 0.1\n",
    "alpha = 0.01\n",
    "n_steps = 1000\n",
    "\n",
    "results = multi_armed_bandit(n_arms, epsilon, alpha, n_steps)\n",
    "\n",
    "fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(16, 6))\n",
    "ax1.plot(results['rewards'])\n",
    "ax1.set(title=f'Observed Reward ($\\epsilon$={epsilon}, $\\\\alpha$={alpha})',\n",
    "        xlabel='step', ylabel='reward')\n",
    "ax2.plot(results['qs'])\n",
    "ax2.set(title=f'Action Values ($\\epsilon$={epsilon}, $\\\\alpha$={alpha})',\n",
    "        xlabel='step', ylabel='value')\n",
    "ax2.legend(range(n_arms));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "Alright, we got some rewards that are kind of all over the place, but the agent seemed to settle in on the first arm as the preferred choice of action relatively quickly. Let's see how well we did at recovering the true means of the Gaussian random variables behind the arms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 430
    },
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:19:01.837002Z",
     "iopub.status.busy": "2021-05-25T01:19:01.834666Z",
     "iopub.status.idle": "2021-05-25T01:19:02.065643Z",
     "shell.execute_reply": "2021-05-25T01:19:02.065136Z"
    },
    "outputId": "0a2516cc-1581-4c84-a840-12bb4d6e6b9a"
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "ax.plot(results['mu'], label='latent')\n",
    "ax.plot(results['qs'][-1], label='learned')\n",
    "ax.set(title=f'$\\epsilon$={epsilon}, $\\\\alpha$={alpha}',\n",
    "       xlabel='action', ylabel='value')\n",
    "ax.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "Well, we seem to have found a very good estimate for action 0, but most of the others are not great. In fact, we can see the effect of the local maxima trap at work -- the greedy part of our algorithm locked onto action 0, which is actually the 2nd best choice to action 6. Since these are the means of Gaussian random variables, we can see that the overlap between the two would be quite high, so even if we did explore action 6, we may draw a sample that is still lower than our estimate for action 0.\n",
    "\n",
    "However, this was just one choice of parameters. Perhaps there is a better combination?\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Interactive Demo: Changing Epsilon and Alpha\n",
    "\n",
    "Use the widget below to explore how varying the values of $\\epsilon$ (exploitation-exploration tradeoff), $\\alpha$ (learning rate), and even the number of actions $k$, changes the behavior of our agent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 145,
     "referenced_widgets": [
      "c813101dc8dd4bd4bc11b2d6b7594442",
      "fce56dc235d24876a777f2fc9dae19ad",
      "143dd7b1f73d4270a514159720638a26",
      "85cbaaf1132d487a9873b40b0f64188f",
      "a9f48e48226d495eb58d872eecf9637d",
      "a984b221d6b44f07903cc9d98c45fa77",
      "03ca471c57ec42be9bbda5df149ccf3d",
      "4fdc3cf2afbe41ffa02a0111f43467cd",
      "5179b303070544999311534dae956db4",
      "bb24ba8b2450427e9e0281bf8991f638",
      "bbcec2d36dda434d92fe81103b09741c",
      "aca5f997102b47e5aff533cb52dc54f2",
      "f23daec4af404ff0bc2453f7d41fd40f",
      "fe981a5f74b9458fa36445a1d65839c7",
      "5392d2ea58f748bb85eb4547c931fb12",
      "50898d51e27f4cfcb5d63c1470420739"
     ]
    },
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:19:02.081714Z",
     "iopub.status.busy": "2021-05-25T01:19:02.073786Z",
     "iopub.status.idle": "2021-05-25T01:19:02.118781Z",
     "shell.execute_reply": "2021-05-25T01:19:02.113827Z"
    },
    "outputId": "4acdfa6a-c941-4440-b934-8aa23b3d66e4"
   },
   "outputs": [],
   "source": [
    "#@title\n",
    "\n",
    "#@markdown Make sure you execute this cell to enable the widget!\n",
    "\n",
    "@widgets.interact_manual(k=widgets.IntSlider(10, min=2, max=15),\n",
    "                         epsilon=widgets.FloatSlider(0.1, min=0.0, max=1.0),\n",
    "                         alpha=widgets.FloatLogSlider(0.01, min=-3, max=0))\n",
    "def explore_bandit_parameters(k=10, epsilon=0.1, alpha=0.001):\n",
    "  results = multi_armed_bandit(k, epsilon, alpha, 1000)\n",
    "  plot_multi_armed_bandit_results(results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "While we can see how changing the epsilon and alpha values impact the agent's behavior, this doesn't give as a great sense of which combination is optimal. Due to the stochastic nature of both our rewards and our policy, a single trial run isn't sufficient to give us this information. Let's run mulitple trials and compare the average performance.\n",
    "\n",
    "First we will look at differet values for $\\epsilon \\in [0.0, 0.1, 0.2]$ to a fixed $\\alpha=0.1$. We will run 200 trials as a nice balance between speed and accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 431
    },
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:19:02.125095Z",
     "iopub.status.busy": "2021-05-25T01:19:02.124514Z",
     "iopub.status.idle": "2021-05-25T01:19:42.570865Z",
     "shell.execute_reply": "2021-05-25T01:19:42.571338Z"
    },
    "outputId": "3a103c2e-e0cd-460d-8ccc-aa4a61a13c3c"
   },
   "outputs": [],
   "source": [
    "# set for reproducibility, comment out / change seed value for different results\n",
    "np.random.seed(1)\n",
    "\n",
    "epsilons = [0.0, 0.1, 0.2]\n",
    "alpha = 0.1\n",
    "n_trials = 200\n",
    "trial_rewards = np.zeros((len(epsilons), n_trials, n_steps))\n",
    "trial_optimal = np.zeros((len(epsilons), n_trials, n_steps))\n",
    "for i, epsilon in enumerate(epsilons):\n",
    "  for n in range(n_trials):\n",
    "    results = multi_armed_bandit(n_arms, epsilon, alpha, n_steps)\n",
    "    trial_rewards[i, n] = results['rewards']\n",
    "    trial_optimal[i, n] = results['optimal']\n",
    "\n",
    "labels = [f'$\\epsilon$={e}' for e in epsilons]\n",
    "fixed = f'$\\\\alpha$={alpha}'\n",
    "plot_parameter_performance(labels, fixed, trial_rewards, trial_optimal)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "On the left we have plotted the average reward over time, and we see that while $\\epsilon=0$ (the greedy policy) does well initially, $\\epsilon=0.1$ starts to do slightly better in the long run, while $\\epsilon=0.2$ does the worst. Looking on the right, we see the percentage of times the optimal action (the best possible choice at time $t$) was taken, and here again we see a similar pattern of $\\epsilon=0.1$ starting out a bit slower but eventually having a slight edge in the longer run.\n",
    "\n",
    "We can also do the same for the learning rates. We will evaluate $\\alpha \\in [0.01, 0.1, 1.0]$ to a fixed $\\epsilon=0.1$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 431
    },
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:19:42.577734Z",
     "iopub.status.busy": "2021-05-25T01:19:42.577187Z",
     "iopub.status.idle": "2021-05-25T01:20:00.213504Z",
     "shell.execute_reply": "2021-05-25T01:20:00.213951Z"
    },
    "outputId": "7720b574-57ce-4221-8aa5-21d5d278b869"
   },
   "outputs": [],
   "source": [
    "# set for reproducibility, comment out / change seed value for different results\n",
    "np.random.seed(1)\n",
    "\n",
    "epsilon = 0.1\n",
    "alphas = [0.01, 0.1, 1.0]\n",
    "n_trials = 200\n",
    "trial_rewards = np.zeros((len(epsilons), n_trials, n_steps))\n",
    "trial_optimal = np.zeros((len(epsilons), n_trials, n_steps))\n",
    "for i, alpha in enumerate(alphas):\n",
    "  for n in range(n_trials):\n",
    "    results = multi_armed_bandit(n_arms, epsilon, alpha, n_steps)\n",
    "    trial_rewards[i, n] = results['rewards']\n",
    "    trial_optimal[i, n] = results['optimal']\n",
    "\n",
    "labels = [f'$\\\\alpha$={a}' for a in alphas]\n",
    "fixed = f'$\\epsilon$={epsilon}'\n",
    "plot_parameter_performance(labels, fixed, trial_rewards, trial_optimal)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "Again we see a balance between an effective learning rate. $\\alpha=0.01$ is too weak to quickly incorporate good values, while $\\alpha=1$ is too strong likely resulting in high variance in values due to the Gaussian nature of the rewards."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "# Summary\n",
    "\n",
    "In this tutorial you implemented both the epsilon-greedy descision algorithm and a learning rule for solving a multi-armed bandit scenario. You saw how balancing exploitation and exploration in action selection is crtical in finding optimal solutions. You also saw how choosing an appropriate learning rate determines how well an agent can generalize the information they receive from rewards.\n"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "include_colab_link": true,
   "name": "W3D4_Tutorial2",
   "provenance": [],
   "toc_visible": true
  },
  "kernel": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
